title: House prices prediction

author: Wittawat Muangkot

date: 2022-04-16

Description

this paper will analyze House prices dataset and built model for regression prediction the sale price of house by following step below;

Import essential package

Collect Data

After check the infomation of this dataset we found that there are missing values in some column and need to clean data before do the next process of model building. As this dataset has many feature train_df 81 columns and test_df 80 columns, so we need to do the feature selection that to select the low significance feature.

First remove ID columns that has no any unique values in columns that mean this columns not useful for our model prediction

Split train_df data into train and test set using for evaluate model, using the test size 30% and train size for 70%

Feature name correction

Some columns name of this dataset start with numeric, so we need to replace with text, to prevent miss understand of programe when we call these columns

Missing data check and Imputation

let start with checking the total number of missing columns in X_train dataset. to prevent overfiting in test dataset, we will Impute missing data refering missing values from the X_train and transform to test set

As data of missing data show many of missing data, so we will separate into 2 group. one more than 5 precent and another less than 5 percent. and we will use impute missing values method depend on group and data type of each group.

next separate each group into category and numerical group

fill the missing in columns missing more than 5 precent both category and numerical group

impute missing data columns with value less than 5 percent by using sampling method in X_train dataset and transform to X_test set

as the comparision data and visualization below show that data was low distrub by using sampling imputation.

now let's look at the distribution of the target within each variable category in original vs imputed columns, as we can set with no difference after we imputed with random sampling method

For imputed in numerical columns, we'll compare distribution original vs imputed columns. show with no difference

let drop original missing columns, we'll using the imputed columns only

Do the process of clean data for the df_test dataset

Feature Engineering

refer to data description file of dataset, we will invent new feature by using interested feature. start with year feature

the chart above show that there are difference in price of house by group of year built and seem that the house with built in the most recent year has the trend to has more price than the older built year

Visualization median SalePrice in compare to the house with remodel and not remodel

as the chart above notice that house with remodel has less prices than the house with no remodel

As the scatter plot above the price invert with age of house and age of garage, say that with more age got the low prices

Visaulization to see trand of sale prices in period of month, year and with sale condition

Visualization the relationship between the house price and each group of YearBuilt

as the result of grouping the yearbuilt we can fit the linear model into all range of data

Categorical Encoding

As the dataset has high cadinality and rare rebels in category feature. we will group some of category labels with low frequency less than 5 percent into rare group

first with showing the object columns with number of unique values less than 9 and find the label with low number of frequency less than 5 percent

Group the labels with low frequency into the Rare group

As we can see with new group Rare we can reduce the number of labels in each columns and can reduce the feature after we precess the one hot encodeing

For another feature with very high cadinality, we'll do the same as the feature has number unique less than 9 and group rare label with less than 5 %

For some columne with order category in X_train dataset we'll encode by OrdinalEncoder and transform into X_test and df_test

For other category will using the one hot encoding

Data Transformation and scaling

basically for building the regression model, need the assumsion of feature to be linear and As the dataset with skew, some outliner and difference of range data, need to scale and transform data to normal distribution before we'll do the next process of built model

The chart below show the distribution, outliner and feature linear assumtion

Feature Selection

before do the next process of built model for prediction, we need to select some feature that most useful for our prediction and before do the process of feature selection we'll compare score of dataset before/after by use the neg_mean_squared_error with the less score show the import of model preformance

we'll check the feature that contain constant value and will remove it to improve model performance

As we notice that there are 15 features that constain the constant values, then we remove those from our dataset

The performance after remove some constant feature, a bit increase

check some feature duplicate

Check correlation of dataset

Next we'll compare the 3 method for select feature and find the best one

After compare we'll use Gradientboosting that has the best score 91.5% with remain 15 features by set mark to filter out the unuseful feature out from dataset

modeling

Compare model for select the best algorithm that can generate the best score for this dataset by use model 'RandomForestRegressor','LinearRegression','Ridge','GradientBoostingRegressor'

The best algorithm is the GradientBoostingRegressor that can generate the score 91.016% and the next algorithm is RandomForestRegressor that score is 90.779%

Hyperparameter tuning

For increasing the performance model we'll tuning hyperparameter for randomforest and GradientBoosting by using random_grid method that will random the hyperparameter that we select and find the best score from random hyperparameter

We got the best score from GradientBoostingRegressor after tuning hyperparameter at 91.44% and Mean absolute error at 15724.85 that conclusion that the prediction prices from this model can have the average error around +/- 15724.85 from actual housing prices.

Fit linear line model to visualize model fit to dataset

Save model and use model for predict df_test dataset